PISA is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.
Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally.
My questions:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
# Plot styles
plt.style.use('fivethirtyeight')
plt.style.use('seaborn-poster')
# We have a lot of columns in our data set
pd.set_option('display.max_columns', 650)
pd.set_option('display.max_rows', 650)
pd.set_option('display.width', 1000)
# Using low_memory=False here to suppress the dtype errors
# per: https://stackoverflow.com/a/40585000
df = pd.read_csv('data/pisa2012.csv', low_memory=False, encoding='latin1')
# Import the freedom scores
df_free = pd.read_csv('data/Freedom_in_the_World_2012.csv')
# High level inspection
df.shape
df.sample(5)
df.describe()
df_free.sample(5)
df_free.info()
The set has 485,490 rows representing individual students in the survey. There are 636 columns that contain data about lots of different factors about the student, there they live, their economic, social, and cultural status, their attitudes about learning, and their performance on a panel of math, reading, and science assessments. Many of the qualitative columns contain contain string values that represent a scale (e.g. "Never" -> "Every day"). There's a wealth of data here!
The features of interest from the PISA dataset to me are primarily:
NC - country code (there is another code called CNT but that seems to include sub-regions, such as Florida and Massachusetts)ST04Q01 - the student's gender (Male or Female)PV1MATH-PV5MATH - plausible values that represent scores on the mathematic sectionPV1READ-PV5READ - plausible values that represent scores on the reading sectionPV1SCIE-PV5SCIE - plausible values that represent scores on the science sectionEC06Q01 - the year the student began learning the language the test was given inESCS - a single index that combines several economic, social and cultural status factors. It's expressed as +/- standard deviations from the mean of population (where 0 = mean). The index of ESCS was used first in the PISA 2000 analysis and at that time was derived from five indices: highest occupational
status of parents (HISEI), highest educational level of parents (in years of education according to ISCED), family wealth,
cultural possessions and home educational resources (all three WLEs based on student reports on home possessions). The ESCS scores were obtained as component scores for the first principal component with zero being the score of an average OECD student and one being the standard deviation across equally weighted OECD countries.And from the Freedom House dataset I'll be using:
political_rights - degree of a country's political freedom, scale of 1=most free to 7=least freecivil_liberties - degree of a country's civil liberties, scale of 1=most free to 7=least freefreedom_status - a single overall rating of the country's freedom, F, PF, and NF stand for Free, Partly Free, and Not FreeFor my assessment, I'll be using ESCS to represent student economic, social, and cultural status. I'll be creating new values to average each of the 5 math, reading, and science tests into single scores for each subject. I'll also be looking at these factors against the Freedom House freedom scores.
# Make a copy for cleaning
df_clean = df.copy()
I noticed several of the string values leading or trailing spaces. The NC (country) column is the only affected one I'll be using.
# Trim trailing/leading spaces from the country names
df_clean['NC'] = df_clean['NC'].apply(lambda x: x.strip())
Now I'm going to join the Freedom House data
# Join the Freedom House data, drop the dupe country column
df_clean = pd.merge(df_clean, df_free, left_on='SUBNATIO', right_on='SUBNATIO')
df_clean.drop(columns=['country'], inplace=True)
df_clean['civil_liberties'].corr(df_clean['political_rights'])
The civil_liberties and political_rights values are very highly positively correlated, so I'm only going to use civil_liberties in my analysis. A value that reflects individual freedom is most relevant to individual academic performance.
From the PISA Data Visualization Contest instructions: "Pupil performance in mathematics, reading and science is coded by plausible values. You can find them in columns: PV1MATH-PV5MATH (for math), PV1READ-PV5READ (for reading) and PV1SCIE-PV5SCIE (for science). For given area all five values PV1-PV5 are just independent estimations of the student performance in given area. For exploration it is fine to use only PV1."
Based on this, I've decided to use the average of each of the five scores in each category as an overall "literacy" value. I then compute the average of all questions to stand as an overall literacy score.
# Each score is scaled to a mean of 500 and a standard deviation of 100
# Average of the five MATH assessment scores
df_clean['math_literacy'] = (
(df_clean['PV1MATH'] + df_clean['PV2MATH'] + df_clean['PV3MATH'] +
df_clean['PV4MATH'] + df_clean['PV5MATH']) / 5)
# Average of the five READING assessment scores
df_clean['read_literacy'] = (
(df_clean['PV1READ'] + df_clean['PV2READ'] + df_clean['PV3READ'] +
df_clean['PV4READ'] + df_clean['PV5READ']) / 5)
# Average of the five SCIENCE assessment scores
df_clean['sci_literacy'] = (
(df_clean['PV1SCIE'] + df_clean['PV2SCIE'] + df_clean['PV3SCIE'] +
df_clean['PV4SCIE'] + df_clean['PV5SCIE']) / 5)
# Average of all assessment scores
df_clean['overall_literacy'] = (
(df_clean['PV1MATH'] + df_clean['PV2MATH'] + df_clean['PV3MATH'] +
df_clean['PV4MATH'] + df_clean['PV5MATH'] + df_clean['PV1READ'] +
df_clean['PV2READ'] + df_clean['PV3READ'] + df_clean['PV4READ'] +
df_clean['PV5READ'] + df_clean['PV1SCIE'] + df_clean['PV2SCIE'] +
df_clean['PV3SCIE'] + df_clean['PV4SCIE'] + df_clean['PV5SCIE']) / 15)
# EC06Q01 = age started learning
df_clean.EC06Q01.unique()
# Convert the EC06Q01 field to an ordered dtype
ordered_var = pd.api.types.CategoricalDtype(ordered=True,
categories=[
'0 to 3 years', '4 to 6 years',
'7 to 9 years',
'10 to 12 years',
'13 years or older'
])
df_clean['EC06Q01'] = df_clean['EC06Q01'].astype(ordered_var)
# Test that the ordered dict dtype worked
df_clean['EC06Q01'].unique()
I'm considering a student to be "disadvantaged" if their ESCS score is -1 or more stdevs below the mean and advantaged if their scores is +1 stdev or more above the mean.
df_clean['disadvantaged'] = df_clean['ESCS'].dropna().apply(lambda x: 1
if x <= -1 else 0)
df_clean['advantaged'] = df_clean['ESCS'].dropna().apply(lambda x: 1
if x >= 1 else 0)
# Test new columns
# Expect: -1.0
print(df_clean.query('disadvantaged == 1')['ESCS'].max())
# Expect: 1.0
print(df_clean.query('advantaged == 1')['ESCS'].min())
Our dataframe is massive which can make it unwieldly to analyze. We're going to reduce it to only those columns we'll be using in our analysis.
df_clean = df_clean[[
'STIDSTD', 'NC', 'ESCS', 'EC06Q01', 'ST04Q01', 'overall_literacy',
'math_literacy', 'read_literacy', 'sci_literacy', 'civil_liberties',
'disadvantaged', 'advantaged'
]]
df_clean.rename(columns={
'STIDSTD': 'student_id',
'NC': 'country',
'EC06Q01': 'age_start_learn',
'ST04Q01': 'gender'
},
inplace=True)
df_clean.sample(5)
df_clean.info()
# Save a copy of the cleaned dataframe
df_clean.to_csv('data/pisa2012_clean.csv')
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
fig, ax = plt.subplots()
fig.set_size_inches(8.5, 11)
df_clean['country'].value_counts(ascending=True).plot(kind="barh", fontsize=11)
ax.set_ylabel("Country", fontsize=14)
ax.set_xlabel("Number of Students", fontsize=14)
plt.show()
fig.savefig("images/students-by-country.png")
fig, ax = plt.subplots()
fig.set_size_inches(8.5, 11)
dt = (df_clean.query('disadvantaged == 1').groupby('country')['student_id'].count() /
df_clean.groupby('country')['student_id'].count()).sort_values() * 100
dt.dropna().plot(kind="barh", fontsize=11)
ax.set_ylabel("Country", fontsize=14)
ax.set_xlabel("Percentage of Students Disadvantaged", fontsize=14)
plt.show()
fig.savefig("images/disadvantaged-students-by-country.png")
fig, ax = plt.subplots()
fig.set_size_inches(8.5, 11)
dt = (df_clean.query('advantaged == 1').groupby('country')['student_id'].count() /
df_clean.groupby('country')['student_id'].count()).sort_values() * 100
dt.dropna().plot(kind="barh", fontsize=11)
ax.set_ylabel("Country", fontsize=14)
ax.set_xlabel("Percentage of Students Advantaged", fontsize=14)
plt.show()
fig.savefig("images/advantaged-students-by-country.png")
df_clean['age_start_learn'].value_counts(normalize=True) * 100
fig, ax = plt.subplots()
sns.countplot(data=df_clean, x='age_start_learn', palette='RdBu_r')
plt.show()
fig.savefig("images/age-started-learning.png")
# Function for plotting multiple histograms from here:
# https://stackoverflow.com/questions/29530355/plotting- \
# multiple-histograms-in-grid
def draw_histograms(df, variables, n_rows, n_cols):
fig = plt.figure()
for i, var_name in enumerate(variables):
ax = fig.add_subplot(n_rows, n_cols, i + 1)
df[var_name].hist(bins=100, ax=ax)
ax.set_title(var_name + " Distribution")
fig.set_size_inches(11, 8)
fig.tight_layout() # Improves appearance a bit.
plt.show()
draw_histograms(df_clean,
['math_literacy',
'read_literacy',
'sci_literacy',
'overall_literacy'],
2, 2)
As expected and described in the PISA literature, each of these scores follows a standard distribution and are centered at approximately 500.
First let's look at a correlation plot for each of the numeric variables in our dataframe.
numeric_vars = [
'ESCS', 'overall_literacy', 'math_literacy', 'read_literacy',
'sci_literacy', 'civil_liberties'
]
# correlation plot
plt.figure(figsize=[10, 10])
sns.heatmap(df_clean[numeric_vars].corr(),
annot=True,
fmt='.3f',
cmap='RdBu',
center=0)
plt.show()
We can see that there is quite a bit of correlation between the math, reading, and science literacy scores. (Of course each of these are also highly correlated with the overall literacy score, which is derived from them.) ESCS has medium correlation with the academic scores. Civil liberties appears to be weakly negatively correlated with the academic scores (in other words, as countries become less free - higher scores - literacy scores decline). We don't yet know whether that is statistically significant, we'll determine that later.
# plot matrix: sample 1000 rows so that plots are clearer and
# they render faster
samples = np.random.choice(df_clean.shape[0], 1000, replace=False)
df_samp = df_clean.loc[samples, :]
g = sns.PairGrid(data=df_samp.dropna(), vars=numeric_vars)
g = g.map_diag(plt.hist, bins=20)
g.map_offdiag(plt.scatter)
plt.show()
fig, ax = plt.subplots()
sns.violinplot(data=df_clean,
x='civil_liberties',
y='overall_literacy',
palette='RdBu_r')
ax.set_ylabel("Overall Literacy", fontsize=14)
ax.set_xlabel("Civil Liberties score (1=most, 6=least free)", fontsize=14)
plt.show()
fig.savefig("images/overall-literacy-civil-liberties.png")
When we look at overall literacy by civil liberties category (1 = most free, 7 = least free) we can see that the most and least free countries seem to perform the best, with a slight decline from most to least free in the middle range (2-5). This is intriguing, and something we'll examine in more detail later.
fig, ax = plt.subplots()
fig.set_size_inches(11, 8)
sns.scatterplot(data=df_clean,
x='overall_literacy',
y='ESCS',
s=25,
palette='RdBu_r')
plt.title("ESCS vs. Overall Literacy")
plt.ylabel("ESCS (+/- stdev)")
plt.xlabel("Overall Literacy Score")
plt.show()
When we plot overall literacy scores against ESCS we can see some correlation. We can measure that:
df_clean['ESCS'].corr(df_clean['overall_literacy'])
ESCS has a medium positive correlation with overall literacy.
df_clean['ESCS'].corr(df_clean['civil_liberties'])
ESCS has a weak negative correlation to civil liberties (meaning less free countries tend to have lower ESCS).
fig, ax = plt.subplots()
sns.violinplot(data=df_clean,
x='age_start_learn',
y='read_literacy',
palette='RdBu_r')
ax.set_ylabel("Reading Literacy", fontsize=14)
ax.set_xlabel("Age Started Learning", fontsize=14)
ax.set_title("Reading Literacy by Age Started Learning")
plt.show()
fig.savefig("images/read-literacy-age-started-learning.png")
From a visual inspection, it appears that reading literacy declines the longer a student goes before starting to learn. We might expect that since students were age 15 when they took the exam. Students in the 13 years and older category would have been studying the test language for no more than 2 years.
Below I'll perform some regression analysis to determine if this decline is statistically significant.
df1 = pd.DataFrame(df_clean.query('disadvantaged == 1'),
columns=['overall_literacy']).assign(opp='Disadvantaged')
df2 = pd.DataFrame(df_clean.query('advantaged == 0 and disadvantaged == 0'),
columns=['overall_literacy'
]).assign(opp='Neither Advantaged\nnor Disadvantaged')
df3 = pd.DataFrame(df_clean.query('advantaged == 1'),
columns=['overall_literacy']).assign(opp='Advantaged')
cdf = pd.concat([df1, df2, df3])
mdf = pd.melt(cdf, id_vars=['opp'], var_name=['overall_literacy'])
fig, ax = plt.subplots()
fig.set_size_inches(11, 8)
ax = sns.violinplot(x='opp', y='value', data=mdf, palette='RdBu')
ax.set_xlabel(None)
ax.set_ylabel("Overall Literacy")
ax.set_title("Overall Literacy by Opportunity")
plt.show()
fig.savefig("images/overall-literacy-by-opportunity.png")
We can see the stark difference in overall academic literacy based on opportunity status. Disadvantaged students seem to have their scores weighted down, pulled below the median of 500. Advantaged students, on the other hand, appear to have their scores pulled up by a force from above. We'll look at whether this difference is statistically significant.
fig, ax = plt.subplots()
sns.boxplot(data=df_clean,
x='age_start_learn',
y='read_literacy',
palette=['mediumorchid', 'mediumaquamarine'],
hue='gender')
ax.set_ylabel("Reading Literacy", fontsize=14)
ax.set_xlabel("Age Started Learning Language", fontsize=14)
plt.show()
fig.savefig("images/read-literacy-gender-age-started-learning.png")
Here we observe again that reading performance appears to decline the later a student learned the test language. Keep in mind that the test subjects were 15-year-olds, meaning that students who only started learning the test language at 13 years or older had fewer than two years of familiarity with the language at test time.
Male students appear to perform worse overall than female students, and the performance gap between female and male students seems to increase the later students first began learning the language.
fig, ax = plt.subplots()
fig.set_size_inches(12, 10)
sns.scatterplot(data=df_clean.dropna(),
x='read_literacy',
y='ESCS',
palette="RdBu_r",
hue='age_start_learn',
s=50)
ax.set_xlabel("Reading Literacy", fontsize=14)
ax.set_ylabel("ESCS (+/- stdev)", fontsize=14)
plt.show()
We can revist our ESCS vs. literacy plot, but this time looking only at reading literacy overlaid with the age student began learning the language. We can see several points for 13 years or older in the lower-left of our plot, both performing poorly on the literacy and below the mean ESCS (disadvantaged).
cnt_sort = df_clean.groupby('country')['overall_literacy'].mean().sort_values(
ascending=False)
fig, ax = plt.subplots()
fig.set_size_inches(8.5, 20)
sns.boxplot(data=df_clean,
y='country',
x='overall_literacy',
order=cnt_sort.index.get_level_values('country'),
dodge=False,
palette='RdBu_r',
hue='civil_liberties',
width=0.8,
linewidth=1.5)
ax.set_ylabel("Country", fontsize=12)
ax.set_xlabel("Avg. Overall Literacy Score", fontsize=12)
legend = ax.legend(loc='best', title_fontsize=10).set_title(
'Civil Liberties\n(1=most free,\n6=least free)')
ax.tick_params(labelsize=12)
plt.show()
fig.savefig("images/overall-literacy-country.png")
cnt_sort = df_clean.groupby('country')['ESCS'].mean().sort_values(
ascending=False).dropna()
fig, ax = plt.subplots()
fig.set_size_inches(8.5, 20)
sns.boxplot(data=df_clean,
y='country',
x='ESCS',
order=cnt_sort.index.get_level_values('country'),
dodge=False,
palette='RdBu_r',
hue='civil_liberties',
width=0.8,
linewidth=1.5)
ax.set_ylabel("Country", fontsize=12)
ax.set_xlabel("ESCS (0=average, +/- stdev)", fontsize=12)
legend = ax.legend(loc='lower right', title_fontsize=10).set_title(
'Civil Liberties\n(1=most free,\n6=least free)')
ax.tick_params(labelsize=13)
plt.show()
fig.savefig("images/escs-country.png")
ESCS seem to be missing for Albania. That made me suspicious that there was an error on my part, especially because Albania comes first alphabetically and their data were at the top of the imported CSV. However, when I investigated further I found that there was good reason for Albania data to be missing and this was intentional:
"For example, the reliability of parental occupation data from Albania was subject to scrutiny, resulting in a recommendation that all data dependant on Albania’s parental occupation data (in particular, all data that use the PISA index of economic, social and cultural status [ESCS]) should be deleted from the database and relevant tables." (Source: PISA 2012 Technical Report, p. 280)
df_new = df_clean.copy()
# Add an intercept column
df_new['intercept'] = 1
# Create one-hot encoded columns for Age Started Learning
df_new = df_new.join(pd.get_dummies(df_new['age_start_learn']))
# Create one-hot encoded columns for Gender
df_new = df_new.join(pd.get_dummies(df_new['gender']))
df_new.sample(5)
Here our null hypothesis is that the mean reading performance for any of the age groups a student began learning ($\mu_{group}$) are equal to the mean of students who began learning between 0 and 3 years ($\mu_{0-3 years}$). We express that as:
$$\large N_{0}: \mu_{group} = \mu_{0-3 years}$$Our alternative hypothesis is that for individual age groups the mean reading performance is not equal to the overall mean reading performance:
$$\large N_{1}: \mu_{group} \neq \mu_{0-3 years}$$Our $\alpha$ (alpha) is 0.05.
# Regression analysis of read_literacy scores
# holding back 0 to 3 years age started learning as baseline
lm = sm.OLS(
df_new['read_literacy'],
df_new[[
'intercept', '4 to 6 years', '7 to 9 years',
'10 to 12 years', '13 years or older'
]],
)
results = lm.fit()
results.summary()
From the above we can reject the null hypothesis: all of the age groups perform worse in reading when compared to 0 to 3 years. This performance penalty increases as the age a student began learning increases. Students who begin learning later than the ages of 0 to 3 years perform worse on the reading literacy tests. For example, a student who did not begin learning the test language until age 13 or older would be expected to perform -70 points lower than the baseline reading literacy score of a student who began learning the language at ages 0 to 3, all other variables being constant.
Our p-value for all of the tested age groups is p=0.00 which is below our $\alpha$ of 0.05.
The less time a child had to learn test language, the worse they performed on reading literacy.
Here our null hypothesis is that the mean overall academic performance is not affected by whether the student is advantaged or disadvantaged. Mean performance for those groups is equal to the overall mean performance.
$$\large N_{0}: \mu_{adv/dis} = \mu$$Our alternative hypothesis is that for students that are advantaged and/or disadvantaged, their mean performance is not equal to mean performance of the overall group:
$$\large N_{1}: \mu_{adv/dis} \neq \mu$$Our $\alpha$ (alpha) is 0.05.
dt = df_new.dropna()
lm = sm.OLS(
dt['overall_literacy'],
dt[[
'intercept', 'disadvantaged', 'advantaged'
]],
)
results = lm.fit()
results.summary()
From our analysis we can reject the null hypothesis and find:
$$\large N_{1} : \mu_{disadvantaged} < \mu$$$$\large N_{1} : \mu_{advantaged} > \mu$$Being advantaged accounts for +42 points and being disadvantaged counts for -38 points, all else being equal. Our p-values are 0 which is below our $\alpha$ of 0.05.
Advantaged children outperform disadvantaged children.
Here our null hypothesis is that the mean overall academic performance is not affected by the civil liberties rating of their country. Mean performance for those students is equal to the overall mean performance.
$$\large N_{0}: \mu_{civ lib} = \mu$$Our alternative hypothesis is that for students in freer countries, their mean performance is not equal to mean performance of the overall group:
$$\large N_{1}: \mu_{civ lib} \neq \mu$$Our $\alpha$ (alpha) is 0.05.
lm = sm.OLS(
df_new['overall_literacy'],
df_new[[
'intercept', 'civil_liberties'
]],
)
results = lm.fit()
results.summary()
Here we determine that the effect of the civil liberties score on overall academic performance is negatively correlated. For each +1 score on the civil liberties rating we expect the academic literacy score to decline by -14. Remember, our civil liberties scale is from 1 (most free) to 7 (least free). That means that children in less free countries perform worse than children in freer societies. The effect is very small, but our p-value = 0 which is below our $\alpha$ of 0.05 so it is statistically significant.
China – a country rated 6 (least free) on our scale consistently performs at the top in our academic scores, so I was curious how much the effect would change if we excluded China. We might expect the gap to widen.
# Exclude the Mainland China and Macao, but leave Taiwan and Hong Kong
dt = df_new.query('country != "China (Shanghai)" and country != "Macao-China"')
lm = sm.OLS(
dt['overall_literacy'],
dt[[
'intercept', 'civil_liberties'
]],
)
results = lm.fit()
results.summary()
Indeed the effect increases when we exclude China. Each +1 on the civil liberties scale reduces the overall literacy score by -20.
Students in freer countries perform better academically than children in less free countries.